This tutorial introduces the wikiimage.py module, which we can use to grab
and process image data from Wikipedia pages. Start by reading in the module,
as well as numpy and pylab (for plotting the images).
%pylab inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import wiki
import wikiimage
import wikitext
plt.rcParams["figure.figsize"] = (12, 16)
The image_data_frame takes a list of Wikipedia pages and returns a data frame object
showing all of the images from the page. You can also supply the minimum and maximum
allowed sizes of images. By default the function will download a local version of any
images you do not yet have locally.
df = wikiimage.image_data_frame(['Paris', 'London'], min_size=300)
df
Note that the returned results include the page name, the path of the image, as well as a column called "max_size". The latter column gives the size of the largest dimension of the image (either the height or width).
The load_image function takes the name of an image and returns a PIL object,
a special image type that can be plotted in Python.
img = wikiimage.load_image(df.img.values[4])
type(img)
plt.imshow(img)
Here is some Python code that prints all of the image in the data frame. Note
that you may need to modify the line plt.subplot(4, 3, ind + 1) if you change
the data. The 4 gives the number of columns in the plot and the 3 gives the number
of rows. If you have more than 12 images, only the first 12 will be shown. You can
also adjust the plt.rcParams["figure.figsize"] = (12, 16) above to change the overall
size of the print out (I find that I need to adjust this depending on my screen and
the images in question).
for ind, idx in enumerate(range(df.shape[0])):
try:
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(4, 3, ind + 1)
img = wikiimage.load_image(df.iloc[idx]['img'])
plt.imshow(img)
plt.axis("off")
except:
pass
Last time we saw how the VGG19 model takes a 224-by-224 dimensional image and returns a list of 1000 probabilities giving predictions of what objects are located in the image. Here's the model once again:
from keras.applications.vgg19 import VGG19
vgg19_full = VGG19(weights='imagenet')
vgg19_full.summary()
The VGG19 model as described here is really only useful if we care about the 1000 categories described in the ILSVRC competition. Why would this be important enough to include in the keras module? In and of itself, it really is not. The reason the model is so important is due to something called transfer learning.
It turns out that if we apply only a subset of the layers, say all but the final layer
of the model, the neural network serves as form of dimensionality reduction. Look at
the model above; if we look at the output of the layer fc2 this serves to project a
224 * 224 * 3, or 150,528 dimensional object, into 4096 dimensional space. To
produce such an embedding, I'll use keras to strip off the second to last layer:
from keras.models import Model
vgg_fc2 = Model(inputs=vgg19_full.input, outputs=vgg19_full.get_layer('fc2').output)
vgg_fc2.summary()
And we can apply the model just as we did before, but the output now contains 4096 dimensions. These dimensions, just like with PCA and t-SNE, do not have an explict meaning. The relationships between images in the embedding space, however, describe semantic relationships, which we will be able to explore shortly.
from keras.preprocessing import image
from keras.applications.vgg19 import preprocess_input
img = wikiimage.load_image(df.img.values[1], target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
y = vgg_fc2.predict(x)
y.shape
The wikiimage module contains the function vgg19_embed that performs
embedding into the fc2 layer. Conveniently, the embedding are cached
so that you only need to construct them once (it can take a while to
create the embeddings).
df_fc2 = wikiimage.vgg19_embed(df.img.values)
df_fc2.shape
The output is a numpy array with one row for each image and 4096 columns. Again, we will see how to use these in just a moment.
As with the Wikipedia pages at the start of the semester, I do not want you to all have to wait a long time to download the images for today's class. Conveniently, we should be able to use the same bulk download function if we are clever about calling the "language" of the images "img" and the "language" of the embeddings "embed". Grab both of these here:
wiki.bulk_download('impressionists-text', lang='en')
wiki.bulk_download('impressionists-image', lang='img')
wiki.bulk_download('impressionists-embed', lang='embed')
For today's tutorial, let's create a dataset of all the pages linked to from the impressionists and extract from these all of the images. Note: you should have almost all of these from the bulk download above. If it starts downloading a lot of stuff, something is wrong!
page_links = wikitext.get_internal_links("Impressionism")['ilinks'] + ["Impressionism"]
df = wikiimage.image_data_frame(page_links, download=True, min_size=224, max_size=750)
df
Next, let's grab the VGG19 embeddings for these images. This may take a minute or two, there is a lot to load, but should finish quickly as almost all of the embeddings should already have been downloaded.
wikiart_fc2 = wikiimage.vgg19_embed(df.img.values)
wikiart_fc2.shape
Now, finally, let's see why these embeddings are so useful. Let's start with the image 700:
start_img = 700
img = wikiimage.load_image(df.iloc[start_img]['img'])
plt.imshow(img)
We can compute the distance in the 4096-dimensional embedding space of this image to all of the other images in our corpus.
dists = np.sum(np.abs(wikiart_fc2 - wikiart_fc2[start_img, :]), 1)
dists.shape
Then, we'll sort these distances and get the indicies of all of the 24 closest images in this space (of course, the closest image will be image 700 itself).
idx = np.argsort(dists.flatten())[:24]
idx
Finally, let's see all of the images in order from closest to farthest:
plt.figure(figsize=(14, 36))
for ind, i in enumerate(idx):
try:
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(8, 3, ind + 1)
img = wikiimage.load_image(df.iloc[i]['img'])
plt.imshow(img)
plt.axis("off")
except:
pass
Fairly accurate, when you consider all of the image types in the corpus, no?
The code below picks a randoming starting point and displays the closest 24 images in the
fc2 space. Run it multiple times, and record particularly interesting numbers. Where does
it work well and where does it run into problems? Tell me about at least one number that
worked better than you expected and one issue that it had trouble dealing with:
Answer:
start_img = np.random.randint(0, df.shape[0])
print("Grabbed image number {0:d}.".format(start_img))
print(df.iloc[start_img])
dists = np.sum(np.abs(wikiart_fc2 - wikiart_fc2[start_img, :]), 1)
idx = np.argsort(dists.flatten())[:24]
plt.figure(figsize=(14, 36))
for ind, i in enumerate(idx):
try:
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.subplot(8, 3, ind + 1)
img = wikiimage.load_image(df.iloc[i]['img'])
plt.imshow(img)
plt.axis("off")
except:
pass